| Player | Club | Age | Position | Nation | Value | League | Gls/90 (20/21) | Touches (20/21) |
|---|---|---|---|---|---|---|---|---|
| Neymar | Paris Saint-Germain | 29 | attack | Brazil | 90000000 | Ligue 1 | 0.6 | 1337 |
| Leandro Trossard | Brighton & Hove Albion | 26 | attack | Belgium | 15300000 | Premier League | 0.2 | 1421 |
| Lukasz Skorupski | Bologna FC 1909 | 30 | Goalkeeper | Poland | 3420000 | Serie A | 0.0 | 1054 |
| Ben White | Arsenal FC | 23 | Defender | England | 25200000 | Premier League | 0.0 | 2064 |
| Danilo Cataldi | SS Lazio | 27 | midfield | Italy | 2250000 | Serie A | 0.0 | 470 |
Predictive Factors of Player Market Value in World Football
1. Introduction
The business model of professional European football has reached a point of reckoning. The hypercompetitive open-league system with no salary cap has resulted in an environment where profit generation is not a realistic option. Rather, the increased financialization of European football has seen clubs continue to overspend their abilities in a race-to-the-bottom for talent. This has resulted in the exponential growth of player spending, which reached more than £6.5bn in the summer 2023 transfer window1.
This study aims to better understand what factors influence market value in players to help clubs move towards more sustainable transfer models. By analyzing what factors contribute towards player market value, clubs can be more prudent in their player trading and academy development philosophies. The project investigates whether factors such as nationality, age, and position are related to player market value.
To address this question, I used a data set from Kaggle that scraped information on global soccer players from Transfermarkt and fbref.com2. The data was last updated on September 2nd, 2021. Each case in the data set is a different player. Table 1 displays a snapshot of 5 randomly chosen rows of the data set being used.
2. Exploratory data analysis
| Age | Value | Gls/90 (20/21) | Touches (20/21) | |
|---|---|---|---|---|
| Min. :16.00 | Min. : 90000 | Min. :0.0000 | Min. : 0.0 | |
| 1st Qu.:24.00 | 1st Qu.: 2250000 | 1st Qu.:0.0000 | 1st Qu.: 509.0 | |
| Median :27.00 | Median : 5850000 | Median :0.0500 | Median : 979.5 | |
| Mean :26.74 | Mean : 11688385 | Mean :0.1228 | Mean :1040.9 | |
| 3rd Qu.:30.00 | 3rd Qu.: 15300000 | 3rd Qu.:0.1700 | 3rd Qu.:1467.2 | |
| Max. :43.00 | Max. :144000000 | Max. :1.5000 | Max. :3543.0 |
The original sample was 2,075 players. I filtered the dataset to include certain categories that I thought it would be interesting to analyze. Since some of the players had missing values for the factors selected, I dropped these rows from consideration. I am not sure why these values were missing, so cannot comment on the impact dropping these results might have on my results.
After cleaning the data, my total sample size was 1,910 players. The summary (Table 2) shows that within this filtered dataset, average age is 26.74, average player value is 11,688,385 EUR, average Goals/90 in 2020-21 was 0.1228, and average touches in 2020-21 was 1040.9. I then explored players by nation (Fig 1), where I observed the five countries with the most players were: Spain (274), France (258), Germany (174), England (176), and Italy (152). I also explored the highest valued players by league (Table 3).
| Player | Club | Age | Position | Nation | Value | League | Gls/90 (20/21) | Touches (20/21) |
|---|---|---|---|---|---|---|---|---|
| Kylian Mbappe | Paris Saint-Germain | 22 | attack | France | 144000000 | Ligue 1 | 1.0 | 1442 |
| Erling Haaland | Borussia Dortmund | 21 | attack | Norway | 117000000 | Bundesliga | 1.0 | 762 |
| Harry Kane | Tottenham Hotspur | 28 | attack | England | 108000000 | Premier League | 0.7 | 1396 |
| Frenkie de Jong | FC Barcelona | 24 | midfield | Netherlands | 81000000 | La Liga | 0.1 | 3090 |
| Lautaro Martinez | Inter Milan | 24 | attack | Argentina | 72000000 | Serie A | 0.6 | 1026 |
Looking at the distribution of player values (Fig 2), it was positively skewed and thus I applied a logarithmic transformation (Fig 3). There was also a potential outlier at 150 million, which is something important to consider throughout my analysis.
In Figure 4, I generated a scatterplot to see the overall relationship between the numerical outcome variable Player Value and the numerical explanatory variable age. As the age of players increased, there was an associated increase in Player Value until the age of ~25, after which player value appeared to decrease. This relationship is represented by a correlation coefficient of -0.3.
# A tibble: 1 × 1
cor
<dbl>
1 -0.3
Figure 5 shows the relationship between the numerical outcome variable Player Value and another numerical explanatory variable Goals per 90 mins (20/21). As the number of goals per 90 minutes increased, there was an associated increase in Player Value. This relationship is represented by a correlation coefficient of 0.28.
# A tibble: 1 × 1
cor
<dbl>
1 0.28
Figure 6 shows the relationship between the numerical outcome variable Player Value and the categorical explanatory variable League. Player value looks to be the greatest for the Premier League with a large gap down to Series A next, then La Liga, the Bundesliga, and finally Ligue 1. There is a large outlier for Ligue 1 (Kylian Mbappe), whilst the Premier League, the Bundesliga, and Series A each have a number of players who are low outliers.
Bringing these together in Figures 8 and 9, I created two scatterplots exploring the relationship between all three variables. The graph simulates an interaction model where each regression line corresponds to each league with slightly different slopes. Given the slopes are quite similar and the regression lines are quite parallel across both models, my EDA might favor a more simple interaction model.
3. Multiple linear regression
3.1 Methods
I chose the components of my multiple linear regression as the following:
Outcome variable \(y\) = Log Player Value
Numerical explanatory variable \(x_1\) = Age
Numerical explanatory variable \(x_2\) = Goals/90 min
Categorical explanatory variable \(x_3\) = League
The unit of analysis is player value in EUR, but because I have used a logarithmic transformation I will interpret the results in percentage terms, where applicable. I created 4 different models for this analysis which analyzed the different relationships between the 3 variables.
3.2 Model Results
| df | AIC | |
|---|---|---|
| m1 | 8 | 5741.36 |
| m2 | 9 | 5578.53 |
| m3 | 9 | 5742.15 |
| m4 | 11 | 5581.57 |
Given that the AIC is the lowest for model 2 (Table 4), I chose to analyze this model for the rest of the analysis.
| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 7.39541 | 0.00000 | 5.84890 | 8.94192 |
| Age | 0.66928 | 0.00000 | 0.55422 | 0.78434 |
| I(Age^2) | -0.01403 | 0.00000 | -0.01613 | -0.01193 |
Gls/90 (20/21) |
2.09663 | 0.00000 | 1.83591 | 2.35736 |
| LeagueBundesliga | -0.00151 | 0.98455 | -0.15422 | 0.15120 |
| LeagueLa Liga | 0.28702 | 0.00023 | 0.13469 | 0.43934 |
| LeaguePremier League | 0.87602 | 0.00000 | 0.72767 | 1.02438 |
| LeagueSerie A | 0.28117 | 0.00023 | 0.13158 | 0.43075 |
3.3 Interpreting the regression table
The Model 2 regression equation for Log Player Value is as follows:
\(\begin{gather} \widehat{\log \text{player value}_i} = 7.39541 + 0.66928(Age_i) - 0.01403 (Age)^2 + 2.09663(Goals/90_i)\\ -0.00151 \cdot 1_{\text{is Bundesliga}}(x_2) + 0.28702 \cdot 1_{\text{is La Liga}}(x_2) + 0.87602 \cdot 1_{\text{is Premier League}}(x_2) + 0.28117 \cdot 1_{\text{is Serie A}}(x_2) \end{gather}\)
- The intercept [exp(7.39541) = €1,628.492] represents the Player Value when players are aged 0, score 0 goals/ 90 minutes, and play in Ligue 1 (Table 5).
- The estimate for the slope for the linear part of age [exp(0.66928) = 1.952831] is the associated percent change in average Player Value with each unit increase in age. Based on this estimate, for every extra year old a player is, there was an associated increase in Player Value of on average 95.28%. However, the negative quadratic term indicates a concave shape to this relationship, meaning the effect of age on Player Value is not strictly linear; rather, it may increase/ decrease at an increasing/ decreasing rate depending on the age value. For example, when age goes from 19-20, we can see an associated increase in player log value (Figure 10), whereas when age goes from 29-30, we observe an associated decrease in player log value (Figure 10), holding all other variables constant.
The estimate for the slope for Goals /90 min for 2020-21 [exp(2.09663) = 8.138696] is the associated percent change in average Player Value with each unit increase in goals/90 minutes in 2020-21. Based on this estimate, for every extra 1 goal/90 minute a player scores, there was an associated increase player value of on average 713.87%.
The estimate for LeagueBundesliga [exp(-0.00151) = 0.9984911], LeagueLa Liga [exp(0.28702) = 1.332451], LeaguePremier League [exp(0.87602) = 2.401323], and LeagueSerie A [exp(0.28117) = 1.324679] are the offsets in intercept relative to the baseline group’s, Ligue 1’s, intercept (Table 2). In other words, on average Bundesliga players are valued 0.15% lower than Ligue 1 players, La Liga players are valued 33.25% higher than Ligue 1 players, Premier League players are valued 140.13% higher than Ligue 1 players, and Serie A players are valued 32.47% higher than Ligue 1 players
Thus the five regression lines are:
\(\begin{gather}\widehat{\log \text{player value Ligue 1}_i} &= 7.39541 - 0.66928(Age_i) - 0.01403 (Age)^2+ 2.02891108(Goals/90_i) \\\widehat{\log \text{player value Bundesliga}_i} &= 7.3939 - 0.66928(Age_i) - 0.01403 (Age)^2 + 2.02891108(Goals/90_i) \\\widehat{\log \text{player value La Liga}_i} &= 7.68243 - 0.66928(Age_i) - 0.01403 (Age)^2+ 2.02891108(Goals/90_i) \\\widehat{\log \text{player value Premier League}_i} &= 8.27143 - 0.66928(Age_i) - 0.01403 (Age)^2+ 2.02891108(Goals/90_i) \\\widehat{\log \text{player value Serie A}_i} &= 7.67658 - 0.66928(Age_i) - 0.01403 (Age)^2+ 2.02891108(Goals/90_i)\end{gather}\)
3.4 Inference and hypothesis testing
Using the output of my regression table I decided to test three different null hypotheses:
3.4.1. Age and Player Value
Null Hypothesis (H0): There is no relationship between Age and Logged Player Value at the population level (i.e. all population slopes are zero). Alternative Hypothesis (HA): There is a relationship between age and logged player value at the population level.
\(\begin{align} H_0: \beta_{\text{Age}} = 0 \end{align}\)
\(\begin{align} H_a: \beta_{\text{Age}} \neq 0 \end{align}\)
There appears to be a negative relationship between age and player value as represented by -0.3 coefficient. Table 5 shows us that:
The 95% confidence interval for the population slope Age is (0.55422, 0.78434) positive on both tails, and for the population slope Age^2 is negative on both tails (-0.01613, -0.01193). Neither of these intervals include zero, suggesting age and age ^2 have a statistically significant impact on the log player value.
The p-value \(p\) < 0.001 is very small for both age and age^2, which suggests statistical significance (p-value <0.05), so we reject the null hypothesis \(H_0\) that there is no relationship between player value and age in favor of the alternative hypothesis \(H_a\) that there is a (negative) relationship.
This means that taking into account potential sampling variation in results, the relationship between age and log player value appears to be negative.
3.4.2. Goals/90 Mins and Player Value
2) Null Hypothesis (H0): There is no relationship between goals/90 mins and Logged Player Value at the population level (i.e. all population slopes are zero). Alternative Hypothesis (HA): There is a relationship between goals/90 mins and logged player value at the population level.
\(\begin{align} H_0: \beta_{\text{Goals/90}} = 0 \end{align}\)
\(\begin{align} H_a: \beta_{\text{Goals/90}} \neq 0 \end{align}\)
There appears to be a positive relationship between age and player value as represented by the 0.28 coefficient. Table 5 shows us that:
The 95% confidence interval for the population slope goals/90 mins is (1.83591, 2.35736) positive on both tails. Since this interval does not include zero, we have evidence to suggesting goals/90 mins have a statistically significant impact on the log player value.
The p-value \(p\) < 0.001 is very small goals/90 mins, which suggests statistical significance (p-value <0.05), so we reject the null hypothesis \(H_0\) that there is no relationship between player value and goals/90 mins in favor of the alternative hypothesis \(H_a\) that there is a (positive) relationship.
This means that taking into account potential sampling variation in results, the relationship between goals/90 mins and log player value appears to be positive.
3.4.3. Leagues and Player Value
3) Null Hypothesis (H0): The differences in intercepts for the non-baseline groups (Bundesliga, La Liga, Serie A, Premier League) are zero. Alternative Hypothesis (HA): The differences in intercepts for non-baseline groups are not equal to zero.
\(\begin{align} H_0: \beta_{\text{Ligue 1-League (B/LL/SA/PL)}} = 0 \end{align}\)
\(\begin{align} H_a: \beta_{\text{Ligue 1-League (B/LL/SA/PL)}} \neq 0 \end{align}\)
Table 5 shows us that:
The 95% confidence interval for the population difference in intercept for each league with Ligue 1 contain one zero (Bundesliga: -0.15422, 0.15120; La Liga: 0.13469, 0.43934; Premier League: 0.72767, 1.02438; Serie A: 0.13158, 0.43075). This suggests that is possible that the difference between the Ligue 1 and Bundesliga intercept is zero, thus it is possible that Ligue 1 and the Bundesliga have the same intercept. However, this is not true for the other leagues.
The p-values are very small (\(p\) < 0.001) for all leagues except for the Bundesliga (\(p\) = 0.98455). This suggests that the intercept for all leagues except the Bundesliga is statistically significant (p-value <0.05). In this case, for all other leagues (La Liga, Serie A, Premier League), we reject the null hypothesis in favor of the alternative hypothesis, since there is evidence to suggest that the intercept for these leagues is significantly different from the Ligue 1 intercept. However, for the Bundesliga, we fail to reject the null hypothesis, since there is insufficient evidence to suggest that the intercept for the Bundesliga is significantly different from the intercept for Ligue 1.
So it appears the differences in intercept are meaningfully different from 0 for all leagues but the Bundesliga. Hence, the Bundesliga and Ligue 1 intercepts are roughly equal, but the La Liga, Premier League, and Serie A intercept are not roughly equal. This consistent with our observations from the visualization of the three regression lines in Figure 8.
3.5 Residual analysis
I conducted a residual analysis to see if there was any systematic pattern of residuals for the prediction model. This is because if there are systematic patterns of residuals, I cannot fully trust the confidence intervals and p-values used. Part of the residuals data can be seen in Table 6:
| log_value | Age | I(Age^2) | Gls/90 (20/21) | League | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|
| 18.78532 | 22 | 484 | 1.02 | Ligue 1 | 17.46948 | 1.31584 | 0.01659 | 1.03939 | 0.00344 | 1.27640 |
| 18.57768 | 21 | 441 | 1.01 | Bundesliga | 17.38085 | 1.19683 | 0.01640 | 1.03946 | 0.00281 | 1.16084 |
| 18.49764 | 28 | 784 | 0.67 | Premier League | 17.41952 | 1.07812 | 0.00782 | 1.03954 | 0.00107 | 1.04118 |
| 18.31532 | 21 | 441 | 0.35 | Premier League | 16.87461 | 1.44071 | 0.00487 | 1.03930 | 0.00118 | 1.38928 |
| 18.31532 | 29 | 841 | 0.64 | Premier League | 17.22641 | 1.08891 | 0.00731 | 1.03953 | 0.00102 | 1.05132 |
| 18.31532 | 29 | 841 | 0.57 | Ligue 1 | 16.20362 | 2.11170 | 0.00692 | 1.03870 | 0.00362 | 2.03840 |
The model residuals were normally distributed, but with a number of potential outliers at the bottom end. There were no systematic patterns to the explanatory variable plots, but there were some outliers in age (some younger players), goals/ 90 min, and at the top end of Ligue 1 and bottom of the Premier League.
The fitted values plot showed a positive relationship to match the assumptions of the linear regression model. That said, it would be useful to repeat the analysis without the outliers to see if this changes the results.
4. Discussion
4.1 Conclusions
I found that as (1) age increased, player values increased/decreased significantly depending on the age the player was; (2) as goals / 90 minutes increased, player values increased significantly; (3) there was a significant difference in player values across different leagues. On average, for each additional year old a player is, the player values increased (generally up to the age of 24/25) or decreased (beyond 24/25); for each additional goal/90 min a player scores, the player values increased on average 713.87%. This however does not mean that age causes players to be worth less, or that goal/90 minute cause a player to be worth more, rather they are associated. It made sense to me that players in the Premier League are generally worth more than all other leagues, since it is commonly known as the most competitive and highest revenue generating European football league, which likely makes the best players want to play there. But I was surprised to find that Serie A player values are the next highest after the Premier League. I expected La Liga players to be next, but maybe this value is skewed slightly by the highest valued players.
Overall, these results suggest that age, goalscoring, and the league you play in is a factor in player value. My findings are consistent with previous studies showing that players peak when they are in their mid-20s3. Teams who are seeking to recruit talent should factor this in when buying players, and perhaps look to avoid paying extra for a player in the Premier League whose stats are similar to a player in Ligue 1. That said, given the inherent differences in levels of competition and cultural differences across leagues, we cannot extrapolate that a Ligue 1 player with the same profile as a Premier League player is necessarily “worth more”, only that he would be cheaper. Teams should also take this into consideration when building their squads from academies – trying to send players on loan to countries where player value “higher” – or when bringing in veteran talent who might be “undervalued” as people think he has passed the peak of his career.
4.2 Limitations
There were several limitations to the data set. Firstly, 165 out of the 2075 players in the original dataset were missing values, so we had to exclude these. Furthermore, there were a number of outliers whose value is extremely large, which might skew the interpretations. The dataset is also limited to 2021 player values across the Big Five Leagues. As a result, our scope of inference is limited to these leagues, and we cannot generalize our findings regarding the impact of age and goals/90 minutes to players from other leagues.
It is also important to recognize that goals/90 minutes is an important metric for attackers and perhaps midfielders, but is not as useful when evaluating the value of defenders and goalkeepers. In this case, it would be useful to segment by position and add new explanatory variables for defenders or goalkeepers such as tackles and clean sheets.
4.3 Further Questions
If I was to continue researching the topic of player valuations in European football, I would like to use data that includes more leagues as well as updated values. As mentioned, the value of the football transfer market has seen a steady rise, but this has also been accompanied in recent years by more data-led recruitment strategies. It would be interesting to see how such analyses have evolved over time to see if the Moneyball approach has impacted the types of players the top leagues sign.
It would be useful to add more explanatory variables, such as assists, tackles, successful passes, yellow cards, and many others. The results from such deeper analysis could be used by football clubs to help inform their football projects and player recruitment and development strategies.
5. Citations and References
1 Eurosport (2023). “Global transfer record broken in summer 2023 as Premier League and Saudi Pro League splash out.” Eurosport. Retrieved from https://www.eurosport.com/football/transfers/2023-2024/global-transfer-record-broken-in-summer-2023-as-premier-league-and-saudi-pro-league-splash-out-recor_sto9770895/story.shtml
2 Varma, Sanjit. (2021). Predicting Soccer Player Transfer Values. Kaggle. https://www.kaggle.com/datasets/sanjitva/predicting-soccer-player-transfer-values
3 The Athletic (2021). “What age do players in different positions peak?” The Athletic. Retrieved from https://theathletic.com/2935360/2021/11/15/what-age-do-players-in-different-positions-peak/